add a fusion op: fused_residual_dropout_bias #34963

zkh2016 · 2021-08-17T07:16:06Z

PR types

New features

PR changes

OPs

Describe

Fused elementwise_add, dropout and elementwise_add into one operator

//before fusion
out1 = elementwise_add(src, bias)
out2 = dropout(out1)
out3 = elementwise_add(residual, out2)

//after fusion
out = fused_residual_dropout_bias(src, residual, bias)

paddle-bot-old · 2021-08-17T07:16:09Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

CLAassistant · 2021-08-17T07:16:16Z

All committers have signed the CLA.

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

paddle/fluid/operators/fused/test_fused_residual_dropout_bias.cu

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

paddle/fluid/operators/fused/test_fused_residual_dropout_bias.cu

xingfeng01 · 2021-08-24T06:55:22Z

LGTM

xingfeng01 · 2021-08-25T11:01:02Z

LGTM

xingfeng01 · 2021-08-26T08:15:05Z

LGTM

paddle/fluid/operators/fused/CMakeLists.txt

Xreki

代码实现上可以考虑能不能进一步封装，加强代码重用。

paddle/fluid/operators/fused/fused_dropout_test.h

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

Xreki · 2021-08-28T05:16:19Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+ * @brief the fused function called by every thread
+ */
+template <typename T, typename MaskType, typename U, int VecSize,
+          bool layer_norm>


layer_norm -> ComputeLayerNorm

Xreki · 2021-08-28T05:18:53Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+                               const platform::CUDADeviceContext &ctx) {
+  // dropout_prob == 1.0f
+  if (std::abs(dropout_prob - 1.0f) < 1e-5) {
+    PADDLE_ENFORCE_CUDA_SUCCESS(


用memory::Copy，另外dst和residual有没有可能是同一个地址？

Xreki · 2021-08-28T05:20:09Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+    PADDLE_ENFORCE_CUDA_SUCCESS(
+        cudaMemcpyAsync(dst, residual, rows * cols * sizeof(T),
+                        cudaMemcpyDeviceToDevice, ctx.stream()));
+    PADDLE_ENFORCE_CUDA_SUCCESS(cudaMemsetAsync(


用math::SetConstant

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

Xreki · 2021-08-28T05:24:40Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+  const int VecSize = 4;
+  if (dbias != nullptr) {
+    int real_vec_size = VecSize;
+    if (cols % VecSize != 0) real_vec_size = 1;


不要写到一行

Xreki · 2021-08-28T05:26:08Z

paddle/fluid/operators/fused/fused_residual_dropout_bias_test.cu

+
+template <typename T>
+struct TestFusedResidualDropoutBias {
+  uint32_t _rows;


类的成员变量命名用xxx_的方式，struct可不加_后缀

Xreki · 2021-08-28T05:28:26Z

paddle/fluid/operators/fused/fused_dropout.h

+#include <curand_kernel.h>
+
+#include <iostream>
+#include <memory>


这2个头文件没有用到吧？

Xreki · 2021-08-28T05:29:04Z

paddle/fluid/operators/fused/fused_dropout.h

@@ -0,0 +1,70 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.


fused_dropout.h文件名不合适。

改成fused_dropout_common.h了

…tGrad

Xreki · 2021-09-01T07:30:18Z

paddle/fluid/operators/fused/fused_dropout_common.h

+    const platform::CUDADeviceContext &ctx, const uint64_t n) {
+  const uint64_t tmp_n = n / VecSize;
+  int threads = std::max(
+      (uint64_t)32, std::min(tmp_n, (uint64_t)ctx.GetMaxThreadsPerBlock()));


类型转换用static_cast

Xreki · 2021-09-01T07:34:59Z

paddle/fluid/operators/fused/fused_dropout_common.h

+/**
+ * get 1D threads and blocks
+ */
+template <int VecSize = 4>


这里VecSize可以通过参数传，另外可以考虑使用中的数据结构和接口：

Paddle/paddle/fluid/platform/gpu_launch_config.h

Lines 36 to 44 in 5eefc8c

struct GpuLaunchConfig {

dim3 theory_thread_count = dim3(1, 1, 1);

dim3 thread_per_block = dim3(1, 1, 1);

dim3 block_per_grid = dim3(1, 1, 1);

int compute_capability = 0;

};

inline GpuLaunchConfig GetGpuLaunchConfig1D(

const platform::CUDADeviceContext& context, int64_t element_count,

Xreki · 2021-09-01T07:37:10Z

paddle/fluid/operators/fused/fused_dropout_common.h

+}
+
+// aligned vector generates vectorized load/store on CUDA
+template <typename T, int VecSize>


没有必要重复定义，也可以直接引用这里的实现：

Paddle/paddle/fluid/platform/fast_divmod.h

Lines 25 to 28 in 5eefc8c

template <typename T, int Size>

struct alignas(sizeof(T) * Size) CudaAlignedVector {

T val[Size];

};

Xreki · 2021-09-01T07:45:09Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+ *
+ */
+
+/********Forward**************/


L27 - L32的注释没什么意义，建议删除。

Xreki · 2021-09-01T07:46:42Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+
+namespace platform = paddle::platform;
+namespace cg = cooperative_groups;
+namespace memory = paddle::memory;


这个namespace别名没有必要，代码中本来就可以直接用platform::float16、memory::Copy

Xreki · 2021-09-01T08:34:36Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+  if (bias != nullptr)
+    *bias_value = *reinterpret_cast<const LoadT *>(&bias[col_id]);
+
+  float4 rand = curand_uniform4(state);


如果VecSize为1，这里是不是会有问题，或者性能不好？

这里无论VecSize是几，都产生4个随机数，也不太合理。

Xreki · 2021-09-01T08:37:48Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+  }
+}
+
+/********Backward**************/


不要这种注释。

Xreki · 2021-09-01T08:40:08Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+ * 2. save 128*8 temporary sum in 8*128 shared memory
+ * 3. reduce the sum of 128 rows data by 8*VecSize warps
+ */
+template <typename T, typename MaskType, int BLOCK_SIZE_X, int BLOCK_SIZE_Y,


BLOCK_SIZE_X -> BlockSizeX

Xreki · 2021-09-01T08:43:55Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+    }
+  }
+
+  // save temporary sum to cache and do transpose


这是为了做Block级别的reduce吗，可以调用下面的实现吗：

Paddle/paddle/fluid/operators/kernel_primitives/compute_primitives.h

Lines 87 to 88 in 7743cdf

template <typename T, typename ReduceOp>

__device__ __forceinline__ T BlockXReduce(T val, ReduceOp reducer) {

有点不太一样，我这边是一个block里面做8路的reduce：block大小是8*128个线程，每128个线程做一列数据的reduce，输出8个sum

Xreki · 2021-09-01T08:49:21Z

paddle/fluid/operators/fused/fused_residual_dropout_bias_test.cu

+};
+
+TEST(FusedDropout, GPUFusedResidualDropoutBias) {
+  const int rows = 16;


一些不同的测试配置，可以通过for循环写到一个TEST里面。

Xreki · 2021-09-01T08:51:31Z

@zhangting2020 也来帮忙review一下吧。

zhangting2020 · 2021-09-01T09:14:24Z

paddle/fluid/operators/fused/fused_residual_dropout_bias_test.cu

+    is_test = false;
+    hasbias = true;
+    platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
+    auto devicectx = pool.Get(place);


命名建议：
hasbias->has_bias
devicectx->device_ctx

zhangting2020 · 2021-09-01T11:51:25Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+          rows, cols, dropout_prob, is_upscale_in_train, src, residual, bias,
+          dst);
+    }
+  }


上面这段函数的调用，if (cols % VecSize != 0) 的判断，看上去只影响了VecSize的设置，有更好的方式解决这种重复的代码吗？

VecSize是个模板参数，没找到其他方法。

zhangting2020 · 2021-09-01T12:21:33Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+  if (bias != nullptr)
+    *bias_value = *reinterpret_cast<const LoadT *>(&bias[col_id]);
+
+  float4 rand = curand_uniform4(state);


这里无论VecSize是几，都产生4个随机数，也不太合理。

zhangting2020 · 2021-09-01T12:48:17Z

paddle/fluid/operators/fused/fused_dropout_test.h

+
+/**
+ * @brief call paddle dropout op
+ */


这个test是测试原来的dropout op吗？我看到还有一个fuse_dropout_op的test。所以没太明白这个单测的用意？

这个头文件是给几个dropout相关的单测共用的，里面call了下dropout_op，作为对比的base版本。

Xreki

LGTM. 一些代码优化建议，可以后续PR再看下。

Xreki · 2021-09-08T05:59:57Z

paddle/fluid/operators/fused/fused_dropout_common.h

+ */
+inline platform::GpuLaunchConfig Get1DBlocksAnd2DGrids(
+    const platform::CUDADeviceContext &ctx, const uint32_t rows,
+    const uint32_t cols, const int VecSize) {


变量名命名用xxx_xxx方式：VecSize -> vec_size

Xreki · 2021-09-08T06:13:58Z

paddle/fluid/operators/fused/fused_dropout_common.h

+}
+
+__forceinline__ __device__ void RandVec(curandStatePhilox4_32_10_t *state,
+                                        float *data, const int VecSize) {


Xreki · 2021-09-08T06:15:00Z

paddle/fluid/operators/fused/fused_dropout_common.h

+  return config;
+}
+
+__forceinline__ __device__ void Rand1(curandStatePhilox4_32_10_t *state,


我觉得写成模板函数、再特化的方式会好一些。

Xreki · 2021-09-08T06:31:54Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+  T factor = static_cast<T>(1.0f / (1.0f - dropout_prob));
+  if (!is_upscale_in_train) {
+    factor = static_cast<T>(1.0f);
+  }


L109 - L112可以用? xx : xx运算符写成一行。

Xreki · 2021-09-08T06:32:35Z

paddle/fluid/operators/fused/fused_residual_dropout_bias.h

+    factor = static_cast<T>(1.0f - dropout_prob);
+    if (is_upscale_in_train) {
+      factor = static_cast<T>(1.0f);
+    }


L114 - L117可以用? xx : xx运算符写成一行。

Xreki · 2021-09-08T06:42:06Z

paddle/fluid/operators/fused/fused_residual_dropout_bias_test.cu

+      test.has_bias = has_bias[j];
+      test.Run();
+      test.CheckOut(default_diff);
+      if (!is_fp16) {


fp16不检查梯度吗？

Xreki · 2021-09-08T06:42:42Z

paddle/fluid/operators/fused/fused_residual_dropout_bias_test.cu

+  T default_diff = static_cast<T>(1e-5);
+  if (is_fp16) {
+    default_diff = static_cast<T>(1e-2);
+  }


L258 - 260也可简化成一行。

Xreki · 2021-09-08T06:43:45Z

paddle/fluid/operators/fused/fused_residual_dropout_bias_test.cu

+    default_diff = static_cast<T>(1e-2);
+  }
+  for (int i = 0; i < cols_list.size(); i++) {
+    for (int j = 0; j < 2; j++) {


此处可写成如下：

for (auto col : {16, 17}) { for (auto has_bias : {true, false}) { ... } }

Xreki · 2021-09-08T06:47:43Z

paddle/fluid/operators/fused/fused_residual_dropout_bias_test.cu

+TEST(FusedDropout, GPUFusedResidualDropoutBias3) {
+  const int rows = 16;
+  const int cols = 16;
+  TestFusedResidualDropoutBias<float> test(rows, cols, 0, 1.0, true, false);


GPUFusedResidualDropoutBias2和GPUFusedResidualDropoutBias3也只是一个参数设置的差别，都可以合并写到一块。

Xreki · 2021-09-08T06:48:31Z

paddle/fluid/operators/fused/fused_residual_dropout_bias_test.cu

+}
+
+// test large shape
+TEST(FusedDropout, GPUFusedResidualDropoutBias6) {


单测名也具化一下，表明实际测的是什么。

zkh2016 force-pushed the develop branch from 717b222 to 15183db Compare August 17, 2021 07:21

zkh2016 force-pushed the develop branch from 64626ae to f911474 Compare August 20, 2021 01:45

zkh2016 force-pushed the develop branch from f911474 to 202ea3c Compare August 20, 2021 05:56

add a fusion op: fused_residual_dropout_bias

bf318b8

zkh2016 force-pushed the develop branch from 9b96cab to bf318b8 Compare August 23, 2021 08:17

simplify the code, andd opt reduce sum

507117a

xingfeng01 reviewed Aug 24, 2021

View reviewed changes

resolve review comments and add comments to the code

462caa1

xingfeng01 previously approved these changes Aug 24, 2021

View reviewed changes

fused_dropout: optimize code structure to facilitate reuse

93e0638

zkh2016 dismissed xingfeng01’s stale review via 93e0638 August 24, 2021 12:35

Merge branch 'PaddlePaddle:develop' into develop

e2808ff

xingfeng01 previously approved these changes Aug 25, 2021

View reviewed changes

optimize code structure to facilitate reuse

036b430

zkh2016 dismissed xingfeng01’s stale review via 036b430 August 25, 2021 11:59

xingfeng01 previously approved these changes Aug 26, 2021

View reviewed changes

lanxianghit reviewed Aug 26, 2021

View reviewed changes

paddle/fluid/operators/fused/CMakeLists.txt Show resolved Hide resolved

Xreki reviewed Aug 28, 2021

View reviewed changes

modify the code according to the review comments

4d33b98

zkh2016 dismissed xingfeng01’s stale review via 4d33b98 August 30, 2021 10:30

zkh2016 added 2 commits August 30, 2021 11:27

replace cudaMemcpy with TensorFromVector and TensorToVector in Dropou…

bd44d04

…tGrad

set dropout attr 'is_test':false

d2beab7

Xreki reviewed Sep 1, 2021

View reviewed changes

Xreki requested a review from zhangting2020 September 1, 2021 08:51

zhangting2020 reviewed Sep 1, 2021

View reviewed changes

zkh2016 added 5 commits September 2, 2021 09:22

optimize the code according to the review comments

5d2bbc8

use static_cast

934fcac

fix the blocks for large shape

44610ea

Merge remote-tracking branch 'upstream/develop' into develop

3133d33

merge upstream, and used new AlignedVector

1a83adb

Xreki approved these changes Sep 8, 2021

View reviewed changes

xingfeng01 approved these changes Sep 8, 2021

View reviewed changes

Xreki merged commit cf8bf03 into PaddlePaddle:develop Sep 9, 2021

AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 29, 2021

add a fusion op: fused_residual_dropout_bias (PaddlePaddle#34963)

bfccc6a

This was referenced Oct 22, 2021

add op: fused_feedforward(forward) #35843

Merged

add op: fused_feedforward(backward) #35611

Merged

		@@ -0,0 +1,70 @@
		/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.

	struct GpuLaunchConfig {
	dim3 theory_thread_count = dim3(1, 1, 1);
	dim3 thread_per_block = dim3(1, 1, 1);
	dim3 block_per_grid = dim3(1, 1, 1);
	int compute_capability = 0;
	};

	inline GpuLaunchConfig GetGpuLaunchConfig1D(
	const platform::CUDADeviceContext& context, int64_t element_count,

	template <typename T, int Size>
	struct alignas(sizeof(T) * Size) CudaAlignedVector {
	T val[Size];
	};

	template <typename T, typename ReduceOp>
	__device__ __forceinline__ T BlockXReduce(T val, ReduceOp reducer) {

add a fusion op: fused_residual_dropout_bias #34963

add a fusion op: fused_residual_dropout_bias #34963

Uh oh!

Conversation

zkh2016 commented Aug 17, 2021

PR types

PR changes

Describe

Uh oh!

paddle-bot-old bot commented Aug 17, 2021

Uh oh!

CLAassistant commented Aug 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xingfeng01 commented Aug 24, 2021

Uh oh!

xingfeng01 commented Aug 25, 2021

Uh oh!

xingfeng01 commented Aug 26, 2021

Uh oh!

Uh oh!

Xreki left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

CLAassistant commented Aug 17, 2021 •

edited

Loading

zkh2016 Sep 2, 2021 •

edited

Loading